[Core][TaskEvents] Add Integration Tests for Task Event Generation#57636
[Core][TaskEvents] Add Integration Tests for Task Event Generation#57636jjyao merged 24 commits intoray-project:masterfrom
Conversation
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
| event["taskDefinitionEvent"]["taskType"] == "DRIVER_TASK" | ||
| and event["taskDefinitionEvent"]["jobId"] != test_job_id | ||
| ): | ||
| driver_task_id = event["taskDefinitionEvent"]["taskId"] |
There was a problem hiding this comment.
I think we should just test it with the field renaming turned off
There was a problem hiding this comment.
Good point. The PR was written before the field renaming. Will the add the test for both and when we remove the env var, we can remove the field renaming in the test.
There was a problem hiding this comment.
I think just testing the renaming disabled is fine since you have other tests to check to renaming but up to you.
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
jjyao
left a comment
There was a problem hiding this comment.
Let's figure out why it takes a long time to run. Ideally CI tests should be fast.
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
I dig deeper into the tests and did some fixes to make the tests more stable. At the same time, I had the following findings from the tests and created followup issues for them:
|
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
| def run_driver_script_and_wait_for_events(script, httpserver, cluster, validation_func): | ||
| httpserver.expect_request("/", method="POST").respond_with_data("", status=200) | ||
| node_ids = [node.node_id for node in cluster.list_all_nodes()] | ||
| assert wait_until_grpc_channel_ready(cluster.gcs_address, node_ids) |
There was a problem hiding this comment.
I think you can use the existing wait_until_server_available util?
There was a problem hiding this comment.
Yeah I looked at the wait_until_server_available before but from what I understand, it only checks whether a port is ready but not the GRPC server is ready for connection for not.
| # before start the driver script. A longer term fix is to improve the start up | ||
| # sequence of the dashboard agent and the workers. | ||
| # Followup issue: https://github.com/ray-project/ray/issues/58007 | ||
| time.sleep(3) |
There was a problem hiding this comment.
since we already wait for the dashboard agent server to be up with wait_until_grpc_channel_ready, why do we need to sleep again?
There was a problem hiding this comment.
My understanding what happened is:
- When the test cluster started, we do ray.init() right away. This creates the driver process along with a core worker process.
- The processes starts to send the events to the aggregator agent every 100ms, even if there is no events (this is fixed in the latest commit.)
- At the time when the core worker starts, the aggregator agent might not ready to receive grpc connection, so the first sends failed. At the same time, the GRPC client's backoff retry strategy kicks in, so in the background, the connection will be retried only after the retry interval completes.
- So it could happen that the aggregator agent server is ready to receive connection but the retry hasn't happened yet. And the wait here was mainly to wait for the retry to happen and the core worker has successfully create the connection.
At the same time, as mentioned in 2, I task buffer's updated to logic to only send grpc requests to aggregator agent when there are events to send. This should eliminate the issue. So I removed the sleep in the latest commit.
|
just add myself so the PR shown up on my github review todo list, not blocking |
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
This PR add integration tests for task status update events.
Related issue number
N/A
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.